Double Pareto Lognormal distribution (dpareto_lognorm)#
The double Pareto lognormal (dPlN) distribution is a continuous distribution on \((0,\infty)\) that combines:
a lognormal “body” (multiplicative variability / proportional growth), and
Pareto (power-law) tails (rare but much more frequent extremes than lognormal)
It’s a strong candidate for positive size variables that look roughly lognormal but show unusually large outliers (e.g., income/wealth, firm size, city size, file sizes, web traffic).
Learning goals#
Classify the distribution and interpret its parameters \((\mu,\sigma,\alpha,\beta)\).
Write down the PDF and CDF (and implement stable
logpdf/logcdf).Derive moments and understand when they are finite.
Implement a fast NumPy-only sampler using a simple generative representation.
Use SciPy’s implementation (
scipy.stats.dpareto_lognorm) for inference and fitting.
import numpy as np
import plotly
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
import scipy
from scipy import special
from scipy.stats import dpareto_lognorm as dpln_dist
from scipy.stats import lognorm, norm
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=6, suppress=True)
rng = np.random.default_rng(42)
# Record versions for reproducibility (useful when numerical details matter).
VERSIONS = {"numpy": np.__version__, "scipy": scipy.__version__, "plotly": plotly.__version__}
VERSIONS
{'numpy': '1.26.2', 'scipy': '1.15.0', 'plotly': '6.5.2'}
1) Title & Classification#
Name:
dpareto_lognorm(Double Pareto Lognormal; SciPy:scipy.stats.dpareto_lognorm)Type: Continuous
Support (standard form): \(x \in (0,\infty)\)
Parameter space (standard form):
\(\mu \in \mathbb{R}\)
\(\sigma > 0\)
\(\alpha > 0\)
\(\beta > 0\)
SciPy shape parameters:
u=\mu,s=\sigma,a=\alpha,b=\betaSciPy location/scale:
loc \in \mathbb{R},scale > 0with $\(X = \text{loc} + \text{scale}\,Y, \qquad Y \sim \mathrm{dPlN}(\mu,\sigma,\alpha,\beta).\)$
Unless stated otherwise, this notebook uses the standard form (loc=0, scale=1).
2) Intuition & Motivation#
What it models#
Many real-world positive quantities are well-modeled by a lognormal in the middle (because they arise from multiplicative effects), but have heavier tails than a lognormal:
too many very large observations (e.g., “winner-take-most” phenomena)
sometimes extra mass near 0 (many very small values)
The double Pareto lognormal addresses this by keeping a lognormal-like bulk while giving both tails power-law behavior.
A particularly useful generative representation is:
Equivalently (exponentiating),
where SciPy’s Pareto here corresponds to support \([1,\infty)\) with tail \(\mathbb{P}(V>v)=v^{-\alpha}\).
Typical real-world use cases#
Income / wealth distributions (lognormal-ish center with Pareto upper tail)
Firm sizes, city sizes, file sizes
Web traffic, sales volumes, and other “size” variables with many small values and occasional extremes
Relations to other distributions#
Lognormal: the “body” looks lognormal when tails are not too heavy.
Pareto: the far right tail behaves like a Pareto with index \(\alpha\).
Laplace on log-scale: the term \(E_1/\alpha - E_2/\beta\) is an asymmetric Laplace random variable, so \(\log X\) is Normal + (asymmetric) Laplace.
3) Formal Definition#
Let \(\phi\) and \(\Phi\) denote the standard normal PDF and CDF:
Define
and the (normal) Mills ratio
PDF#
For \(x>0\),
CDF#
For \(x>0\),
and \(F(x)=0\) for \(x\le 0\) in the standard form.
Quantile function (PPF)#
There is no simple closed form for the PPF; SciPy computes it numerically.
LOG_SQRT_2PI = 0.5 * np.log(2.0 * np.pi)
def log_phi(z: np.ndarray) -> np.ndarray:
"""Log of standard normal PDF."""
z = np.asarray(z, dtype=float)
return -0.5 * z**2 - LOG_SQRT_2PI
def log_Phi(z: np.ndarray) -> np.ndarray:
"""Log of standard normal CDF (stable)."""
z = np.asarray(z, dtype=float)
return special.log_ndtr(z)
def log_Phic(z: np.ndarray) -> np.ndarray:
"""Log of standard normal survival function 1-Φ(z) = Φ(-z) (stable)."""
z = np.asarray(z, dtype=float)
return special.log_ndtr(-z)
def log_R(t: np.ndarray) -> np.ndarray:
"""Log Mills ratio R(t) = (1-Φ(t))/φ(t), computed stably."""
t = np.asarray(t, dtype=float)
return log_Phic(t) - log_phi(t)
def log1mexp(a: np.ndarray) -> np.ndarray:
"""Compute log(1 - exp(a)) for a <= 0 in a numerically stable way."""
a = np.asarray(a, dtype=float)
if np.any(a > 0):
raise ValueError("log1mexp is only defined for a <= 0")
out = np.empty_like(a)
cutoff = -np.log(2.0)
mask = a < cutoff
out[mask] = np.log1p(-np.exp(a[mask]))
out[~mask] = np.log(-np.expm1(a[~mask]))
return out
def dpln_logpdf(x: np.ndarray, mu: float, sigma: float, alpha: float, beta: float) -> np.ndarray:
"""Log-PDF of the standard dPlN(μ,σ,α,β) distribution (loc=0, scale=1)."""
x = np.asarray(x, dtype=float)
out = np.full_like(x, fill_value=-np.inf, dtype=float)
mask = x > 0
xm = x[mask]
z = (np.log(xm) - mu) / sigma
y1 = alpha * sigma - z
y2 = beta * sigma + z
out[mask] = (
np.log(alpha)
+ np.log(beta)
- np.log(alpha + beta)
- np.log(xm)
+ log_phi(z)
+ np.logaddexp(log_R(y1), log_R(y2))
)
return out
def dpln_pdf(x: np.ndarray, mu: float, sigma: float, alpha: float, beta: float) -> np.ndarray:
"""PDF of the standard dPlN(μ,σ,α,β) distribution."""
return np.exp(dpln_logpdf(x, mu, sigma, alpha, beta))
def dpln_logcdf(x: np.ndarray, mu: float, sigma: float, alpha: float, beta: float) -> np.ndarray:
"""Log-CDF of the standard dPlN(μ,σ,α,β) distribution (stable)."""
x = np.asarray(x, dtype=float)
out = np.full_like(x, fill_value=-np.inf, dtype=float)
mask = x > 0
xm = x[mask]
z = (np.log(xm) - mu) / sigma
y1 = alpha * sigma - z
y2 = beta * sigma + z
t1 = log_Phi(z)
t2 = log_phi(z)
t3 = np.log(beta) + log_R(y1)
t4 = np.log(alpha) + log_R(y2)
# log(beta*R(y1) - alpha*R(y2)) with sign tracking.
t5, sign = special.logsumexp(
np.stack([t3, t4], axis=0),
b=np.stack([np.ones_like(t3), -np.ones_like(t3)], axis=0),
axis=0,
return_sign=True,
)
temp = np.stack([t1, t2 + t5 - np.log(alpha + beta)], axis=0)
out[mask] = special.logsumexp(
temp,
b=np.stack([np.ones_like(t1), -sign], axis=0),
axis=0,
)
return out
def dpln_cdf(x: np.ndarray, mu: float, sigma: float, alpha: float, beta: float) -> np.ndarray:
"""CDF of the standard dPlN(μ,σ,α,β) distribution."""
return np.exp(dpln_logcdf(x, mu, sigma, alpha, beta))
def dpln_logsf(x: np.ndarray, mu: float, sigma: float, alpha: float, beta: float) -> np.ndarray:
"""Log survival function log(1-F(x)), computed stably via log1mexp."""
return log1mexp(dpln_logcdf(x, mu, sigma, alpha, beta))
def dpln_sf(x: np.ndarray, mu: float, sigma: float, alpha: float, beta: float) -> np.ndarray:
"""Survival function 1-F(x)."""
return np.exp(dpln_logsf(x, mu, sigma, alpha, beta))
# Sanity check: our formulas match SciPy.
mu, sigma, alpha, beta = 0.2, 0.7, 2.5, 1.2
x = np.logspace(-4, 4, 200)
assert np.allclose(dpln_pdf(x, mu, sigma, alpha, beta), dpln_dist.pdf(x, mu, sigma, alpha, beta))
assert np.allclose(dpln_cdf(x, mu, sigma, alpha, beta), dpln_dist.cdf(x, mu, sigma, alpha, beta))
assert np.allclose(dpln_logpdf(x, mu, sigma, alpha, beta), dpln_dist.logpdf(x, mu, sigma, alpha, beta))
assert np.allclose(dpln_logcdf(x, mu, sigma, alpha, beta), dpln_dist.logcdf(x, mu, sigma, alpha, beta))
4) Moments & Properties#
Tail behavior (why “double Pareto”)#
The distribution has power-law tails:
As \(x\to\infty\), the right tail behaves like $\(\mathbb{P}(X>x) \sim C\,x^{-\alpha},\)\( so the PDF decays like \)x^{-\alpha-1}$.
As \(x\to 0^+\), the left tail behaves like $\(\mathbb{P}(X\le x) \sim C\,x^{\beta},\)\( so the PDF behaves like \)x^{\beta-1}$ near 0.
This is why \(\alpha\) controls upper extremes, while \(\beta\) controls how much mass sits near very small values.
Mean, variance, skewness, kurtosis#
All positive moments have a clean form. For \(k<\alpha\),
Consequences:
Mean exists iff \(\alpha>1\).
Variance exists iff \(\alpha>2\).
Skewness exists iff \(\alpha>3\).
Kurtosis exists iff \(\alpha>4\).
(Similarly, negative moments exist up to order \(\beta\).)
MGF / characteristic function#
The MGF \(M_X(t)=\mathbb{E}[e^{tX}]\) does not exist for any \(t>0\) because the right tail is polynomial.
The Laplace transform exists for \(t<0\).
The characteristic function \(\varphi_X(t)=\mathbb{E}[e^{itX}]\) exists for all real \(t\) (bounded integrand), but there is no simple elementary closed form.
Entropy#
The differential entropy \(h(X)=-\mathbb{E}[\log f(X)]\) is finite for typical parameters and is available via scipy.stats.dpareto_lognorm.entropy (computed numerically).
def dpln_raw_moment(k: float, mu: float, sigma: float, alpha: float, beta: float) -> float:
"""Raw moment E[X^k] for k < alpha (standard form)."""
if alpha <= k:
return np.nan
return (alpha * beta) / ((alpha - k) * (beta + k)) * np.exp(k * mu + 0.5 * (k * sigma) ** 2)
def dpln_mvsk_from_raw(mu: float, sigma: float, alpha: float, beta: float):
"""Return (mean, var, skew, kurt_excess) when moments exist; else NaNs."""
m1 = dpln_raw_moment(1, mu, sigma, alpha, beta)
m2 = dpln_raw_moment(2, mu, sigma, alpha, beta)
m3 = dpln_raw_moment(3, mu, sigma, alpha, beta)
m4 = dpln_raw_moment(4, mu, sigma, alpha, beta)
mean = m1
var = m2 - m1**2
# Central moments via raw moments.
mu3 = m3 - 3 * m2 * m1 + 2 * m1**3
mu4 = m4 - 4 * m3 * m1 + 6 * m2 * m1**2 - 3 * m1**4
skew = mu3 / (var ** 1.5)
kurt_excess = mu4 / (var**2) - 3.0
return mean, var, skew, kurt_excess
mu, sigma, alpha, beta = 0.2, 0.6, 6.0, 2.0
mean_f, var_f, skew_f, kurt_f = dpln_mvsk_from_raw(mu, sigma, alpha, beta)
mean_s, var_s, skew_s, kurt_s = dpln_dist.stats(mu, sigma, alpha, beta, moments="mvsk")
entropy_s = dpln_dist.entropy(mu, sigma, alpha, beta)
print("Moments from closed-form raw moments:")
print(" mean =", mean_f)
print(" var =", var_f)
print(" skew =", skew_f)
print(" kurt =", kurt_f)
print()
print("Moments from SciPy:")
print(" mean =", mean_s)
print(" var =", var_s)
print(" skew =", skew_s)
print(" kurt =", kurt_s)
print(" entropy=", entropy_s)
# Monte Carlo check (should be close when moments are finite).
N = 80_000
x_mc = dpln_dist.rvs(mu, sigma, alpha, beta, size=N, random_state=rng)
print()
print("Monte Carlo (N=80k):")
print(" mean ≈", x_mc.mean())
print(" var ≈", x_mc.var())
print(" entropy ≈", -dpln_dist.logpdf(x_mc, mu, sigma, alpha, beta).mean())
Moments from closed-form raw moments:
mean = 1.1698276715473797
var = 0.9301438713517878
skew = 2.787565892454241
kurt = 18.307973079452186
Moments from SciPy:
mean = 1.1698276715473797
var = 0.9301438713517878
skew = 2.7875658924542424
kurt = 18.307973079452186
entropy= 1.0436993462234498
Monte Carlo (N=80k):
mean ≈ 1.1713176076871294
var ≈ 0.9315080413645616
entropy ≈ 1.044800564368963
5) Parameter Interpretation#
\(\mu\) (log-location) shifts the distribution multiplicatively: increasing \(\mu\) scales typical values by \(e^{\Delta\mu}\).
\(\sigma\) (log-scale) increases spread in the lognormal “body” and also amplifies variability in both tails.
\(\alpha\) (right-tail index) controls how fast the upper tail decays:
smaller \(\alpha\) → heavier right tail (more extremes)
moments exist only up to order \(<\alpha\)
\(\beta\) (left-tail index) controls behavior near 0:
smaller \(\beta\) → more mass near 0 (and stronger divergence of the density as \(x\to 0^+\) if \(\beta<1\))
Below we visualize how the PDF changes when varying one parameter at a time.
# Parameter sweeps: how the PDF changes.
x = np.logspace(-3, 3, 600)
base = dict(mu=0.0, sigma=0.6, alpha=3.0, beta=2.0)
# 1) Vary alpha (right tail)
fig_alpha = go.Figure()
for a in [1.5, 3.0, 8.0]:
y = dpln_dist.pdf(x, base["mu"], base["sigma"], a, base["beta"])
fig_alpha.add_trace(go.Scatter(x=x, y=y, mode="lines", name=f"alpha={a}"))
fig_alpha.update_xaxes(type="log", title="x")
fig_alpha.update_yaxes(type="log", title="pdf(x)")
fig_alpha.update_layout(title="Effect of alpha: right-tail heaviness")
fig_alpha.show()
# 2) Vary beta (left tail)
fig_beta = go.Figure()
for b in [0.5, 1.5, 4.0]:
y = dpln_dist.pdf(x, base["mu"], base["sigma"], base["alpha"], b)
fig_beta.add_trace(go.Scatter(x=x, y=y, mode="lines", name=f"beta={b}"))
fig_beta.update_xaxes(type="log", title="x")
fig_beta.update_yaxes(type="log", title="pdf(x)")
fig_beta.update_layout(title="Effect of beta: mass near 0")
fig_beta.show()
# 3) Vary sigma (lognormal spread)
fig_sigma = go.Figure()
for s in [0.2, 0.6, 1.0]:
y = dpln_dist.pdf(x, base["mu"], s, base["alpha"], base["beta"])
fig_sigma.add_trace(go.Scatter(x=x, y=y, mode="lines", name=f"sigma={s}"))
fig_sigma.update_xaxes(type="log", title="x")
fig_sigma.update_yaxes(type="log", title="pdf(x)")
fig_sigma.update_layout(title="Effect of sigma: lognormal-body spread")
fig_sigma.show()
6) Derivations#
6.1 Expectation (general moment)#
Using the generative form
we get for any real \(k\) (when the expectations exist)
Independence implies
Now:
If \(Z\sim\mathcal{N}(\mu,\sigma^2)\) then \(\mathbb{E}[e^{kZ}] = \exp\left(k\mu + \tfrac{1}{2}k^2\sigma^2\right)\).
If \(E\sim\mathrm{Exp}(1)\) then \(\mathbb{E}[e^{tE}] = \frac{1}{1-t}\) for \(t<1\).
Therefore, for \(k<\alpha\),
and multiplying gives
Setting \(k=1\) yields the mean (when \(\alpha>1\)).
6.2 Variance#
When \(\alpha>2\),
using the moment formula with \(k=1\) and \(k=2\).
6.3 Likelihood (i.i.d. sample)#
For i.i.d. data \(x_1,\ldots,x_n\) (all \(>0\)), the log-likelihood is
A convenient stable form (using the definitions from Section 3) is
where the last term is best computed with a logaddexp (log-sum-exp) trick.
def dpln_loglik(x: np.ndarray, mu: float, sigma: float, alpha: float, beta: float) -> float:
x = np.asarray(x, dtype=float)
return float(dpln_logpdf(x, mu, sigma, alpha, beta).sum())
# Quick check: likelihood prefers the true parameters on synthetic data.
mu0, sigma0, alpha0, beta0 = 0.1, 0.7, 3.5, 2.0
x = dpln_dist.rvs(mu0, sigma0, alpha0, beta0, size=2_000, random_state=rng)
ll_true = dpln_loglik(x, mu0, sigma0, alpha0, beta0)
ll_perturbed = dpln_loglik(x, mu0 + 0.3, sigma0 * 1.1, alpha0 * 0.8, beta0 * 1.3)
print("log-likelihood at true params =", ll_true)
print("log-likelihood at perturbed params=", ll_perturbed)
log-likelihood at true params = -2294.9214816192944
log-likelihood at perturbed params= -2600.229763380172
7) Sampling & Simulation#
NumPy-only algorithm#
Using the representation
with \(Z\sim\mathcal{N}(\mu,\sigma^2)\) and \(E_1,E_2\sim\mathrm{Exp}(1)\) independent, sampling is straightforward:
Draw \(Z\) from a normal.
Draw \(E_1,E_2\) from independent exponentials.
Return \(X=\exp\left(Z + E_1/\alpha - E_2/\beta\right)\).
This is exactly the approach used internally by SciPy.
def dpln_rvs_numpy(
mu: float,
sigma: float,
alpha: float,
beta: float,
size: int | tuple[int, ...],
rng: np.random.Generator | None = None,
) -> np.ndarray:
"""NumPy-only sampler for the standard dPlN(μ,σ,α,β) distribution."""
if rng is None:
rng = np.random.default_rng()
Z = rng.normal(mu, sigma, size=size)
E1 = rng.standard_exponential(size=size)
E2 = rng.standard_exponential(size=size)
return np.exp(Z + E1 / alpha - E2 / beta)
mu, sigma, alpha, beta = 0.0, 0.6, 3.0, 2.0
x_np = dpln_rvs_numpy(mu, sigma, alpha, beta, size=200_000, rng=rng)
x_sp = dpln_dist.rvs(mu, sigma, alpha, beta, size=200_000, random_state=rng)
print("NumPy sampler vs SciPy sampler (same RNG stream not expected to match):")
print(" mean NumPy:", x_np.mean())
print(" mean SciPy:", x_sp.mean())
print(" median NumPy:", np.median(x_np))
print(" median SciPy:", np.median(x_sp))
NumPy sampler vs SciPy sampler (same RNG stream not expected to match):
mean NumPy: 1.1982726939077264
mean SciPy: 1.1967383826043556
median NumPy: 0.8688818370335867
median SciPy: 0.8725882969800218
8) Visualization#
We’ll look at:
the PDF (and Monte Carlo histogram)
the CDF (and an empirical CDF)
the tail on a log–log plot to reveal the Pareto slope
mu, sigma, alpha, beta = 0.0, 0.6, 3.0, 2.0
samples = dpln_rvs_numpy(mu, sigma, alpha, beta, size=60_000, rng=rng)
# Focus on the body up to a high quantile; tails are shown separately.
x_max = np.quantile(samples, 0.995)
x_grid = np.logspace(-3, np.log10(x_max), 600)
pdf_grid = dpln_dist.pdf(x_grid, mu, sigma, alpha, beta)
fig = go.Figure()
fig.add_histogram(
x=samples,
histnorm="probability density",
nbinsx=140,
name="Monte Carlo",
opacity=0.55,
)
fig.add_trace(
go.Scatter(x=x_grid, y=pdf_grid, mode="lines", name="PDF (theory)", line=dict(width=3))
)
fig.update_xaxes(type="log", title="x", range=[np.log10(1e-3), np.log10(x_max)])
fig.update_yaxes(title="density")
fig.update_layout(title="dpareto_lognorm: PDF + Monte Carlo histogram")
fig.show()
# CDF + empirical CDF
x_grid = np.logspace(-3, 3, 700)
cdf_grid = dpln_dist.cdf(x_grid, mu, sigma, alpha, beta)
xs = np.sort(samples)
ecdf = np.arange(1, xs.size + 1) / xs.size
fig = go.Figure()
fig.add_trace(go.Scatter(x=x_grid, y=cdf_grid, mode="lines", name="CDF (theory)", line=dict(width=3)))
fig.add_trace(
go.Scatter(x=xs, y=ecdf, mode="lines", name="empirical CDF", line=dict(width=2, dash="dot"))
)
fig.update_xaxes(type="log", title="x")
fig.update_yaxes(title="F(x)")
fig.update_layout(title="dpareto_lognorm: CDF vs empirical CDF")
fig.show()
# Tail plot: survival function on log-log scale
x_tail = np.logspace(0, 4, 200)
sf_tail = dpln_dist.sf(x_tail, mu, sigma, alpha, beta)
fig = go.Figure()
fig.add_trace(go.Scatter(x=x_tail, y=sf_tail, mode="lines", name="SF(x)=P(X>x)"))
# Reference slope ~ x^{-alpha}
C = sf_tail[0] * (x_tail[0] ** alpha)
ref = C * x_tail ** (-alpha)
fig.add_trace(
go.Scatter(x=x_tail, y=ref, mode="lines", name=f"reference ~ x^(-{alpha})", line=dict(dash="dash"))
)
fig.update_xaxes(type="log", title="x")
fig.update_yaxes(type="log", title="P(X>x)")
fig.update_layout(title="Right tail looks Pareto on log-log axes")
fig.show()
9) SciPy Integration#
SciPy provides scipy.stats.dpareto_lognorm with the usual distribution API:
pdf,logpdfcdf,logcdfsf,logsfrvsfit(MLE)
You can also freeze a distribution with specific parameters to get a reusable object.
# Frozen distribution
mu, sigma, alpha, beta = 0.1, 0.7, 3.5, 2.0
dist = dpln_dist(mu, sigma, alpha, beta)
x = np.logspace(-2, 3, 6)
print("pdf:", dist.pdf(x))
print("cdf:", dist.cdf(x))
print("rvs:", dist.rvs(size=5, random_state=rng))
# Fitting example (MLE) on synthetic data.
true_params = (0.1, 0.7, 3.5, 2.0)
data = dpln_dist.rvs(*true_params, size=5_000, random_state=rng)
# Fix loc=0 and scale=1 to estimate only (u,s,a,b) in the standard form.
(u_hat, s_hat, a_hat, b_hat, loc_hat, scale_hat) = dpln_dist.fit(data, floc=0, fscale=1)
print()
print("True (u,s,a,b):", true_params)
print("Fit (u,s,a,b):", (u_hat, s_hat, a_hat, b_hat))
print("(loc, scale) fixed:", (loc_hat, scale_hat))
pdf: [0.027764 0.274118 0.462581 0.000945 0. 0. ]
cdf: [0.000139 0.013823 0.538674 0.997068 0.999999 1. ]
rvs: [3.412395 7.857961 2.926468 0.45226 0.671159]
True (u,s,a,b): (0.1, 0.7, 3.5, 2.0)
Fit (u,s,a,b): (0.10183273702285217, 0.7123237885111698, 3.6887988783026655, 2.1678064142366296)
(loc, scale) fixed: (0, 1)
/home/tempa/miniconda3/lib/python3.12/site-packages/scipy/stats/_continuous_distns.py:1963: RuntimeWarning:
divide by zero encountered in divide
10) Statistical Use Cases#
10.1 Hypothesis testing (goodness-of-fit)#
If parameters are specified in advance (not estimated on the same data), you can test whether data plausibly comes from a dPlN distribution using goodness-of-fit tests like Kolmogorov–Smirnov (KS).
Caveat: if you estimate parameters from the data and then run KS on the same data, usual p-values are not exact (you’d want a corrected procedure or a bootstrap).
10.2 Bayesian modeling#
Because we can evaluate the likelihood \(p(x\mid\mu,\sigma,\alpha,\beta)\), we can place priors on parameters and do Bayesian inference:
Below we show a simple 1D example: infer \(\mu\) with \((\sigma,\alpha,\beta)\) held fixed.
10.3 Generative modeling#
In simulation or synthetic-data generation, dPlN is a convenient way to generate lognormal-like samples with realistic heavy tails.
# Hypothesis testing example: KS test when parameters are known.
from scipy.stats import kstest
mu, sigma, alpha, beta = 0.0, 0.6, 3.0, 2.0
x = dpln_dist.rvs(mu, sigma, alpha, beta, size=3_000, random_state=rng)
D, p_value = kstest(x, dpln_dist(mu, sigma, alpha, beta).cdf)
print("KS test against dPlN(u=0, s=0.6, a=3, b=2):")
print(" D =", D)
print(" p-value=", p_value)
# Compare against a (mis-specified) lognormal with the same fitted parameters.
shape_ln, loc_ln, scale_ln = lognorm.fit(x, floc=0)
D_ln, p_ln = kstest(x, lognorm(shape_ln, loc_ln, scale_ln).cdf)
print()
print("KS test against fitted lognormal (p-value not exact due to fitting):")
print(" D =", D_ln)
print(" p-value=", p_ln)
KS test against dPlN(u=0, s=0.6, a=3, b=2):
D = 0.01289957414019094
p-value= 0.6952763982788772
KS test against fitted lognormal (p-value not exact due to fitting):
D = 0.03175003284745531
p-value= 0.004620185724241905
# Model comparison via AIC: dPlN vs lognormal on the same data.
# Fit dPlN (u,s,a,b) with loc/scale fixed.
(u_hat, s_hat, a_hat, b_hat, _, _) = dpln_dist.fit(x, floc=0, fscale=1)
ll_dpln = dpln_dist.logpdf(x, u_hat, s_hat, a_hat, b_hat).sum()
# Fit lognormal (shape, scale) with loc fixed.
(shape_ln, _, scale_ln) = lognorm.fit(x, floc=0)
ll_lognorm = lognorm.logpdf(x, shape_ln, loc=0, scale=scale_ln).sum()
k_dpln = 4 # u, s, a, b
k_lognorm = 2 # shape (sigma), scale (exp(mu))
AIC_dpln = 2 * k_dpln - 2 * ll_dpln
AIC_lognorm = 2 * k_lognorm - 2 * ll_lognorm
print("AIC (lower is better):")
print(" dPlN =", AIC_dpln)
print(" lognormal=", AIC_lognorm)
AIC (lower is better):
dPlN = 6550.366513886966
lognormal= 6654.193797013038
# Bayesian example: posterior for mu with (sigma, alpha, beta) fixed.
mu_true, sigma, alpha, beta = 0.0, 0.6, 3.0, 2.0
x = dpln_dist.rvs(mu_true, sigma, alpha, beta, size=600, random_state=rng)
mu_grid = np.linspace(-1.0, 1.0, 600)
# Prior: mu ~ Normal(0, 0.7^2)
mu0, tau = 0.0, 0.7
log_prior = norm.logpdf(mu_grid, loc=mu0, scale=tau)
# Vectorized log-likelihood over the grid.
log_like = dpln_dist.logpdf(x[:, None], mu_grid[None, :], sigma, alpha, beta).sum(axis=0)
log_post_unnorm = log_prior + log_like
log_post = log_post_unnorm - special.logsumexp(log_post_unnorm)
post = np.exp(log_post)
fig = go.Figure()
fig.add_trace(go.Scatter(x=mu_grid, y=post, mode="lines", name="posterior p(mu|x)"))
fig.add_vline(x=mu_true, line_dash="dash", line_color="black", annotation_text="true mu")
fig.update_xaxes(title="mu")
fig.update_yaxes(title="posterior density")
fig.update_layout(title="Bayesian inference for mu (sigma, alpha, beta fixed)")
fig.show()
# Generative modeling: how alpha changes extremes (same mu,sigma,beta).
mu, sigma, beta = 0.0, 0.6, 2.0
alphas = [1.5, 3.0, 8.0]
N = 50_000
summary = []
for a in alphas:
xs = dpln_dist.rvs(mu, sigma, a, beta, size=N, random_state=rng)
summary.append(
{
"alpha": a,
"median": float(np.median(xs)),
"q95": float(np.quantile(xs, 0.95)),
"q99": float(np.quantile(xs, 0.99)),
"max": float(xs.max()),
}
)
summary
[{'alpha': 1.5,
'median': 1.140906784453676,
'q95': 6.5539715901899305,
'q99': 19.640716249820006,
'max': 2277.2901435545964},
{'alpha': 3.0,
'median': 0.8734973177090273,
'q95': 3.1817893858567854,
'q99': 5.771254078804477,
'max': 88.8552624536584},
{'alpha': 8.0,
'median': 0.7198174084958109,
'q95': 2.2711930633331923,
'q99': 3.5614201502442215,
'max': 15.600886798001836}]
11) Pitfalls#
Parameter constraints: require \(\sigma>0\), \(\alpha>0\), \(\beta>0\). The support is \(x>0\) in the standard form.
Moment existence: mean/variance/skewness/kurtosis require \(\alpha>1,2,3,4\) respectively; otherwise these summaries are infinite/undefined.
Numerical stability:
tails can underflow in
pdf/cdf; preferlogpdf,logcdf,logsf.computing \(R(t)=(1-\Phi(t))/\phi(t)\) naively can overflow/underflow; use stable log-space computations.
Fitting (MLE):
heavy tails can make optimization sensitive; check diagnostics (Q–Q plots, tail plots).
parameters can trade off (e.g., \(\sigma\) vs tail indices), so estimates can be noisy with limited data.
KS test after fitting: standard p-values are not exact when parameters are estimated from the same sample.
12) Summary#
dpareto_lognormis a continuous distribution on \((0,\infty)\) with a lognormal-like bulk and Pareto tails.The tail index \(\alpha\) controls upper extremes and determines which positive moments exist.
Moments have a clean closed form via the representation \(\log X = Z + E_1/\alpha - E_2/\beta\).
Sampling is easy with NumPy (normal + exponentials), and SciPy provides a full API including
fit.
References#
Reed, W. J., & Jorgensen, M. (2004). The double Pareto-lognormal distribution — a new parametric model for size distributions. Communications in Statistics — Theory and Methods.
Hajargasht, G., & Griffiths, W. E. (2013). Pareto-lognormal distributions: Inequality, poverty, and estimation from grouped income data. Economic Modelling.
SciPy documentation:
scipy.stats.dpareto_lognorm.